Feature/unswizzle by int-smart · Pull Request #2732 · NVIDIA/TransformerEngine

int-smart · 2026-03-04T05:09:04Z

Description

This PR adds unswizzle support for scaling factors and extends the swizzle module so scaling tensors can be converted from GEMM-swizzled layout back to compact layout, including multi-tensor paths. It also adds round-trip and standalone tests to validate unswizzle correctness.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added unswizzle APIs and implementation in transformer_engine/common/swizzle/swizzle.cu and declarations in transformer_engine/common/include/transformer_engine/swizzle.h
Added multi-tensor unswizzle support with swizzle-like validation assumptions (homogeneous scaling mode/layout, swizzled input and compact output expectations)
Refactored multi-tensor unswizzle launch/kernels to mirror swizzle structure (split row-wise and column-wise kernels) for easier readability
Added/extended tests in tests/cpp/operator/test_swizzle.cu, including standalone unswizzle and swizzle→unswizzle round-trip coverage

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

- Introduced `nvte_unswizzle_scaling_factors` to convert swizzled scaling factors back to row-major format. - Implemented `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` for unshuffling operations in CUDA kernels. - Added `unswizzle_row_scaling_kernel_impl` and `unswizzle_col_scaling_kernel_impl` for handling unswizzling in row and column scaling respectively. These changes enhance the functionality of the swizzle module, enabling better handling of scaling factors in tensor operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

These enhancements tests the changes introduced for unswizzling Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Introduced `compute_ref_unswizzle` to handle the conversion of swizzled scaling factors back to their original format. - Added `performTestUnswizzle1D` to validate the unswizzling process with various scaling modes. - Created `UnswizzleTestSuite` for comprehensive testing of unswizzling operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Moved the definition of `swizzle_row_scaling_kernel` to a new location for better organization. - Ensured the kernel implementation is now properly defined and accessible for scaling operations in the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Introduced `multi_tensor_unswizzle_scaling_factors` to convert swizzled scaling factors back to their original row-major format. - Implemented CUDA kernels for unswizzling in both row and column scaling, enhancing the swizzle module's functionality. - Updated the launch function to handle multiple tensor unswizzling operations efficiently. These changes improve the handling of scaling factors in tensor operations, ensuring better performance and organization within the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-04T05:17:43Z

Greptile Summary

This PR adds unswizzle support for MXFP8 and NVFP4 scaling factors, providing the inverse operation to the existing nvte_swizzle_scaling_factors API. It introduces nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors, along with the corresponding GPU kernels (unswizzle_row_scaling_kernel_impl, unswizzle_col_scaling_kernel_impl) and byte-level inverse shuffle helpers (regs_unshuffle, regs_unshuffle_with_bit_shifts). Tests covering standalone unswizzle and swizzle→unswizzle round-trips are also added.

Key observations:

The two new inverse shuffle functions (regs_unshuffle and regs_unshuffle_with_bit_shifts) are mathematically correct inverses of their forward counterparts for all supported LType variants (int, int2, int4).
Bug in test helpers: In both performTestUnswizzle1D and performTestSwizzleUnswizzleRoundtrip, the variables SF_MODE_X and SF_MODE_Y are uninitialized when the !(rowwise || columnwise) branch of the skip guard is taken, causing undefined behaviour in the GTEST_SKIP message.
The rowwise_swizzle/columnwise_swizzle variable names in multi_tensor_unswizzle_scaling_factors should be named rowwise_unswizzle/columnwise_unswizzle to avoid confusion about data-flow direction.
Minor: both skip messages are missing a space before "is not implemented.", and the regs_unshuffle_with_bit_shifts function body is not followed by a blank line before the next template.

Confidence Score: 4/5

Production kernel logic is sound; two UB issues in test helpers should be fixed before merging.
The core GPU kernels and inverse shuffle helpers are mathematically correct and mirror the existing swizzle structure. The only defects found are confined to test code: two instances of reading uninitialized variables in GTEST_SKIP messages (undefined behaviour, though unlikely to cause silent data corruption in production). The remaining issues are cosmetic naming/style concerns. No functional regression risks exist in the shipped library code.
tests/cpp/operator/test_swizzle.cu — fix uninitialized SF_MODE_X/SF_MODE_Y before merging.

Important Files Changed

Filename	Overview
transformer_engine/common/swizzle/swizzle.cu	Adds `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` (verified correct inverses of their forward counterparts), two new GPU kernel `impl` functions for row- and column-wise unswizzle, a unified dispatch kernel, multi-tensor row/col unswizzle kernels, and the full `unswizzle_scaling_factors` / `multi_tensor_unswizzle_scaling_factors` host-side logic. Variable names `rowwise_swizzle`/`columnwise_swizzle` in the multi-tensor unswizzle path are misleadingly named; a missing blank line exists after `regs_unshuffle_with_bit_shifts`.
transformer_engine/common/include/transformer_engine/swizzle.h	Adds declarations for `nvte_unswizzle_scaling_factors` and `nvte_multi_tensor_unswizzle_scaling_factors` with complete Doxygen comments that accurately describe inputs, outputs, and requirements. No issues found.
tests/cpp/operator/test_swizzle.cu	Adds `compute_ref_unswizzle`, `performTestUnswizzle1D`, `performTestSwizzleUnswizzleRoundtrip`, and corresponding GTest suites/instantiations. Two separate instances of undefined behaviour: `SF_MODE_X`/`SF_MODE_Y` are read uninitialized in the GTEST_SKIP message when neither `rowwise` nor `columnwise` is set (lines 155-157 and 296-298). Also has missing spaces in both skip messages.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant nvte_unswizzle_scaling_factors
    participant unswizzle_scaling_factors
    participant unswizzle_scaling_kernel
    participant unswizzle_row_impl as unswizzle_row_scaling_kernel_impl
    participant unswizzle_col_impl as unswizzle_col_scaling_kernel_impl

    Caller->>nvte_unswizzle_scaling_factors: input (swizzled), output (compact), stream
    nvte_unswizzle_scaling_factors->>unswizzle_scaling_factors: convertNVTETensorCheck()
    unswizzle_scaling_factors->>unswizzle_scaling_factors: validate scaling_mode, dtype, shapes
    alt rowwise_unswizzle
        unswizzle_scaling_factors->>unswizzle_scaling_kernel: launch<<<grid,block,slm,stream>>>
        unswizzle_scaling_kernel->>unswizzle_row_impl: row_scaling=true
        unswizzle_row_impl->>unswizzle_row_impl: load tiles to SLM
        unswizzle_row_impl->>unswizzle_row_impl: regs_unshuffle()
        unswizzle_row_impl->>unswizzle_row_impl: write compact output
    else columnwise_unswizzle
        unswizzle_scaling_factors->>unswizzle_scaling_kernel: launch<<<grid,block,slm,stream>>>
        unswizzle_scaling_kernel->>unswizzle_col_impl: row_scaling=false
        unswizzle_col_impl->>unswizzle_col_impl: load tiles to SLM
        unswizzle_col_impl->>unswizzle_col_impl: regs_unshuffle_with_bit_shifts()
        unswizzle_col_impl->>unswizzle_col_impl: write compact output
    end
    unswizzle_scaling_factors-->>Caller: compact scale_inv in output

_{Last reviewed commit: 621bc16}

greptile-apps · 2026-03-04T05:17:46Z

tests/cpp/operator/test_swizzle.cu

+  if ((rowwise && columnwise) || !(rowwise || columnwise)){
+    GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
+      std::to_string(SF_MODE_Y) + "is not implemented.";
+  }


Uninitialized variables used in skip message

When !(rowwise || columnwise) is true (neither flag is set), neither if (rowwise) nor if (columnwise) branch executes, leaving SF_MODE_X and SF_MODE_Y uninitialized. Passing them to std::to_string() is undefined behaviour.

The same issue exists in performTestSwizzleUnswizzleRoundtrip at line 297.

Suggested change

if ((rowwise && columnwise) || !(rowwise || columnwise)){

GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +

std::to_string(SF_MODE_Y) + "is not implemented.";

}

if ((rowwise && columnwise) || !(rowwise || columnwise)){

GTEST_SKIP() << "TEST SKIPPED, The scaling mode is not implemented.";

}

greptile-apps · 2026-03-04T05:17:47Z

tests/cpp/operator/test_swizzle.cu

+  if ((rowwise && columnwise) || !(rowwise || columnwise)){
+    GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
+      std::to_string(SF_MODE_Y) + "is not implemented.";
+  }


Uninitialized variables used in skip message (roundtrip test)

Same undefined-behaviour issue as in performTestUnswizzle1D — SF_MODE_X and SF_MODE_Y are uninitialized when !(rowwise || columnwise).

Suggested change

if ((rowwise && columnwise) || !(rowwise || columnwise)){

GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +

std::to_string(SF_MODE_Y) + "is not implemented.";

}

if ((rowwise && columnwise) || !(rowwise || columnwise)){

GTEST_SKIP() << "TEST SKIPPED, The scaling mode is not implemented.";

}

greptile-apps · 2026-03-04T05:17:48Z

tests/cpp/operator/test_swizzle.cu

+
+  if ((rowwise && columnwise) || !(rowwise || columnwise)){
+    GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
+      std::to_string(SF_MODE_Y) + "is not implemented.";


Missing space in skip message

The concatenated string produces "...32is not implemented." (no space before "is"). Add a leading space.

Suggested change

std::to_string(SF_MODE_Y) + "is not implemented.";

std::to_string(SF_MODE_Y) + " is not implemented.";

greptile-apps · 2026-03-04T05:17:49Z

tests/cpp/operator/test_swizzle.cu

+
+  if ((rowwise && columnwise) || !(rowwise || columnwise)){
+    GTEST_SKIP() << "TEST SKIPPED, The scaling mode " + std::to_string(SF_MODE_X) + "x" +
+      std::to_string(SF_MODE_Y) + "is not implemented.";


Missing space in skip message (roundtrip test)

Same missing space issue — produces "...32is not implemented." without a space.

Suggested change

std::to_string(SF_MODE_Y) + "is not implemented.";

std::to_string(SF_MODE_Y) + " is not implemented.";

greptile-apps · 2026-03-04T05:17:50Z

transformer_engine/common/swizzle/swizzle.cu

+  for (int i = 0; i < kVectorSize; i++) regs[i] = new_regs[i];
+}
 template <typename LType, int SF_TILE_DIM_M, int SF_TILE_DIM_K>


Missing blank line after function definition

regs_unshuffle_with_bit_shifts ends and the next template declaration begins immediately (no blank line). Every other function pair in this file is separated by a blank line. Add one for consistency.

Suggested change

for (int i = 0; i < kVectorSize; i++) regs[i] = new_regs[i];

}

template <typename LType, int SF_TILE_DIM_M, int SF_TILE_DIM_K>

for (int i = 0; i < kVectorSize; i++) regs[i] = new_regs[i];

}

template <typename LType, int SF_TILE_DIM_M, int SF_TILE_DIM_K>

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-04T05:17:51Z

transformer_engine/common/swizzle/swizzle.cu

+  const bool rowwise_swizzle = all_has_data || all_nvfp4;
+  const bool columnwise_swizzle = all_has_columnwise_data && !all_nvfp4;


Misleading variable names in unswizzle function

rowwise_swizzle and columnwise_swizzle are declared inside multi_tensor_unswizzle_scaling_factors but refer to unswizzle operations, not swizzle. This can confuse future readers about the data-flow direction. Consider renaming to rowwise_unswizzle / columnwise_unswizzle to match the function's purpose.

Suggested change

const bool rowwise_swizzle = all_has_data || all_nvfp4;

const bool columnwise_swizzle = all_has_columnwise_data && !all_nvfp4;

const bool rowwise_unswizzle = all_has_data || all_nvfp4;

const bool columnwise_unswizzle = all_has_columnwise_data && !all_nvfp4;

int-smart and others added 6 commits March 3, 2026 20:40

Add swizzle/unswizzle roundtrip test for scaling factors

6a064cf

These enhancements tests the changes introduced for unswizzling Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

621bc16

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

vthumbe1503 added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/unswizzle#2732

Feature/unswizzle#2732
int-smart wants to merge 6 commits intoNVIDIA:mainfrom
int-smart:feature/unswizzle

int-smart commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	std::to_string(SF_MODE_Y) + "is not implemented.";
	std::to_string(SF_MODE_Y) + " is not implemented.";

		const bool rowwise_swizzle = all_has_data \|\| all_nvfp4;
		const bool columnwise_swizzle = all_has_columnwise_data && !all_nvfp4;

Conversation

int-smart commented Mar 4, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants